Our research focuses on performing Exploratory Data Analysis (EDA) on Google Play Store apps to uncover patterns, trends, and insights regarding app characteristics, user behavior, and installation patterns. We are trying to see how app popularity, defined as the number of installs with high reviews and ratings, is impacted by categories, last updated, app sizes, version, and other factors.
“What is the impact of content rating, required Android version, app category, size, and pricing on predicting app success in terms of positive ratings and high user reviews, as well as the number of installs, using data from Google Play Store apps from 2010 to 2018?”
Specific: The question clearly defines the variables (content rating, required Android version, app category, size, pricing) and the outcomes (positive ratings, high user reviews, number of installs).
Measurable: The outcomes (positive ratings, high user reviews, number of installs) are quantifiable.
Achievable: Given the availability of Google Play Store data from 2010 to 2018, the analysis is feasible.
Relevant: The question addresses a significant issue in the app development and marketing industry: predicting app success.
Time-specific: The timeframe (2010-2018) is clearly defined.
data_apps <- data.frame(read.csv("googleplaystore.csv"))
#Checking the structure of the data
str(data_apps)
## 'data.frame': 10841 obs. of 13 variables:
## $ App : chr "Photo Editor & Candy Camera & Grid & ScrapBook" "Coloring book moana" "U Launcher Lite – FREE Live Cool Themes, Hide Apps" "Sketch - Draw & Paint" ...
## $ Category : chr "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" ...
## $ Rating : num 4.1 3.9 4.7 4.5 4.3 4.4 3.8 4.1 4.4 4.7 ...
## $ Reviews : chr "159" "967" "87510" "215644" ...
## $ Size : chr "19M" "14M" "8.7M" "25M" ...
## $ Installs : chr "10,000+" "500,000+" "5,000,000+" "50,000,000+" ...
## $ Type : chr "Free" "Free" "Free" "Free" ...
## $ Price : chr "0" "0" "0" "0" ...
## $ Content.Rating: chr "Everyone" "Everyone" "Everyone" "Teen" ...
## $ Genres : chr "Art & Design" "Art & Design;Pretend Play" "Art & Design" "Art & Design" ...
## $ Last.Updated : chr "January 7, 2018" "January 15, 2018" "August 1, 2018" "June 8, 2018" ...
## $ Current.Ver : chr "1.0.0" "2.0.0" "1.2.4" "Varies with device" ...
## $ Android.Ver : chr "4.0.3 and up" "4.0.3 and up" "4.0.3 and up" "4.2 and up" ...
#Display all the duplicated Apps
duplicate_apps <- aggregate(App ~ ., data = data_apps, FUN = length)
duplicate_apps <- duplicate_apps[duplicate_apps$App > 1, ]
duplicate_apps <- duplicate_apps[order(-duplicate_apps$App), ]
#View(duplicate_apps)
#print(duplicate_apps)
print(paste("Number of duplicated Apps:",nrow(duplicate_apps)))
## [1] "Number of duplicated Apps: 404"
#Removing Na values and duplicates
data_clean <- data_apps[!is.na(data_apps$App), ]
data_clean <- data_clean[!duplicated(data_clean$App), ]
#(After removing the duplicates) Unique values
unique_apps <- length(unique(data_clean$App))
print(paste("Number of unique apps after removing the duplicates:", unique_apps))
## [1] "Number of unique apps after removing the duplicates: 9660"
Duplicate App Analysis:
#DataFrame includes unique values and Na for all variables in data after removing duplicates
unique_values_list <- lapply(data_clean, unique)
unique_counts_list <- lapply(data_clean, function(col) length(unique(col)))
null_counts_list <- lapply(data_clean, function(col) sum(is.na(col)))
unique_df <- data.frame(
Unique_Values = sapply(unique_values_list, function(x) paste(x, collapse = ", ")),
Unique_Counts = unlist(unique_counts_list),
Null_Counts = unlist(null_counts_list)
)
typeof(data_apps$Price)
## [1] "character"
Convertion of Price to numerical is required. There is ‘$’ present after each price of the App. Check and remove before conversion.
#To check if there is dollar symbol present
#data_clean$Price[]
# Remove dollar symbols and convert to numeric
data_clean$Price <- as.numeric(gsub("\\$", "", data_clean$Price))
#Recheck for dollar symbol
#data_clean$Price[]
All the dollar symbols are removed succesfully.
# Summary statistics for price
summary(data_clean$Price)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 1.099 0.000 400.000 1
From the unique_df, there is a missing value present in the Price column. Let’s handle it!
#Checking for missing values in Price
missing_na <- is.na(data_clean$Price)
missing_blank <- data_clean$Price == ""
sum(missing_na)
## [1] 1
sum(missing_blank, na.rm = TRUE)
## [1] 0
# Remove row where Price is NA or blank
data_clean <- data_clean[!is.na(data_clean$Price) & data_clean$Price != "", ]
Have removed one row #10473 which app does not have a category nameas it is not relevant to our analysis.
#Recheck for missing values
summary(data_clean$Price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 1.099 0.000 400.000
Missing values removed succesfully. (Price)
#Count Plot for the Price distribution
ggplot(data_clean, aes(x=Price)) +
geom_histogram(binwidth=2, fill="pink", color="black") +
xlim(0, 500) + ylim(0, 500) +
labs(title="Price Distribution", x="Price", y="Frequency") +
theme_minimal()
The data is highly skewed as there are many zero price entries.
# Boxplot for the same
ggplot(data_clean, aes(y=Price)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 16, outlier.size = 1, fill="pink", color="black") +
labs(title="Price Boxplot", y="Price") +
theme_minimal()
outlierKD2 <- function(df, var, rm = FALSE, boxplt = FALSE, histogram = TRUE, qqplt = FALSE) {
dt <- df # Duplicate the dataframe for potential alteration
var_name <- eval(substitute(var), eval(dt))
na1 <- sum(is.na(var_name))
m1 <- mean(var_name, na.rm = TRUE)
colTotal <- boxplt + histogram + qqplt # Calculate the total number of charts to be displayed
par(mfrow = c(2, max(2, colTotal)), oma = c(0, 0, 3, 0)) # Adjust layout for plots
# Q-Q plot with custom title
if (qqplt) {
qqnorm(var_name, main="Q-Q plot without Outliers")
qqline(var_name)
}
# Histogram with custom title
if (histogram) {
hist(var_name,main = "Histogram without Outliers", xlab = NA, ylab = NA)
}
# Box plot with custom title
if (boxplt) {
boxplot(var_name, main= "Box Plot without Outliers")
}
# Identify outliers
outlier <- boxplot.stats(var_name)$out
mo <- mean(outlier)
var_name <- ifelse(var_name %in% outlier, NA, var_name)
# Q-Q plot without outliers
if (qqplt) {
qqnorm(var_name, main="Q-Q plot with Outliers")
qqline(var_name)
}
# Histogram without outliers
if (histogram) {
hist(var_name, main = "Histogram with Outliers", xlab = NA, ylab = NA)
}
# Box plot without outliers
if (boxplt) {
boxplot(var_name, main = "Boxplot with Outliers")
}
# Add the title for the overall plot section if any plots are displayed
if (colTotal > 0) {
title("Outlier Check", outer = TRUE)
na2 <- sum(is.na(var_name))
cat("Outliers identified:", na2 - na1, "\n")
cat("Proportion (%) of outliers:", round((na2 - na1) / sum(!is.na(var_name)) * 100, 1), "\n")
cat("Mean of the outliers:", round(mo, 2), "\n")
cat("Mean without removing outliers:", round(m1, 2), "\n")
cat("Mean if we remove outliers:", round(mean(var_name, na.rm = TRUE), 2), "\n")
}
# Remove outliers if `rm = TRUE`
if (rm) {
dt[as.character(substitute(var))] <- invisible(var_name)
cat("Outliers successfully removed", "\n")
return(invisible(dt))
} else {
cat("Nothing changed", "\n")
return(invisible(df))
}
}
#outlier function is defined in previous chunck of code.
outlier_check_price = outlierKD2(data_clean, Price, rm = FALSE, boxplt = TRUE, qqplt = TRUE)
## Outliers identified: 756
## Proportion (%) of outliers: 8.5
## Mean of the outliers: 14.05
## Mean without removing outliers: 1.1
## Mean if we remove outliers: 0
## Nothing changed
The price values in the dataset, including both typical and extreme values, are valid observations for our analysis. As such, removing these outliers may not be beneficial for our study.
#To check the value ranges
table(data_clean$Price)
##
## 0 0.99 1 1.04 1.2 1.26 1.29 1.49 1.5 1.59 1.61
## 8903 145 3 1 1 1 1 46 1 1 1
## 1.7 1.75 1.76 1.96 1.97 1.99 2 2.49 2.5 2.56 2.59
## 2 1 1 1 1 73 3 25 1 1 1
## 2.6 2.9 2.95 2.99 3.02 3.04 3.08 3.28 3.49 3.61 3.88
## 1 1 1 124 1 1 1 1 7 1 1
## 3.9 3.95 3.99 4.29 4.49 4.59 4.6 4.77 4.8 4.84 4.85
## 1 1 57 1 9 1 1 1 1 1 1
## 4.99 5 5.49 5.99 6.49 6.99 7.49 7.99 8.49 8.99 9
## 70 1 5 26 5 11 2 7 2 5 1
## 9.99 10 10.99 11.99 12.99 13.99 14 14.99 15.46 15.99 16.99
## 19 2 2 3 4 2 1 9 1 1 2
## 17.99 18.99 19.4 19.9 19.99 24.99 25.99 28.99 29.99 30.99 33.99
## 2 1 1 1 5 3 1 1 5 1 1
## 37.99 39.99 46.99 74.99 79.99 89.99 109.99 154.99 200 299.99 379.99
## 1 2 1 1 1 1 1 1 1 1 1
## 389.99 394.99 399.99 400
## 1 1 12 1
As aldready mentioned, there are 8903 free apps (More apps with price as 0).
table(data_clean$Type)
##
## Free Paid
## 8902 756
From the price column, we can see 8903 apps are free but it is misread somewhere in the Type column. So lets check!
#Missing values
print(paste("Missing values:",sum(is.na(data_clean$Type))))
## [1] "Missing values: 0"
data_clean[is.na(data_clean$Type), ]
## [1] App Category Rating Reviews Size
## [6] Installs Type Price Content.Rating Genres
## [11] Last.Updated Current.Ver Android.Ver
## <0 rows> (or 0-length row.names)
# Replace NaN or missing values in the Type column with "Free"
data_clean$Type[is.na(data_clean$Type)] <- "Free"
There is one row 9150, has a missing value for Type. As the price is 0, replaced it with “Free”.
ggplot(data_clean, aes(x = Type)) +
geom_bar(fill = "pink", color = "black") +
labs(title = "Distribution of App Types (Free vs Paid)", x = "Type", y = "Count") +
theme_minimal()
As it is clear, there are more free apps.
data_clean$Type <- as.factor(data_clean$Type)
summary_by_type <- data.frame(
Type = levels(data_clean$Type),
Min_Price = tapply(data_clean$Price, data_clean$Type, min, na.rm = TRUE),
Max_Price = tapply(data_clean$Price, data_clean$Type, max, na.rm = TRUE),
Mean_Price = tapply(data_clean$Price, data_clean$Type, mean, na.rm = TRUE),
Median_Price = tapply(data_clean$Price, data_clean$Type, median, na.rm = TRUE)
)
print(summary_by_type)
## Type Min_Price Max_Price Mean_Price Median_Price
## Free Free 0.00 0 0.00000 0.00
## NaN NaN 0.00 0 0.00000 0.00
## Paid Paid 0.99 400 14.04515 2.99
ggplot(data_clean, aes(x = Type, y = Price, fill = Type)) +
geom_boxplot() +
labs(title = "Price Distribution by App Type",
x = "App Type",
y = "Price ($)") +
theme_minimal()
ggplot(data_clean, aes(x = Price, fill = Type)) +
geom_histogram(binwidth = 60, alpha = 0.7, position = "identity") +
facet_wrap(~ Type) +
labs(title = "Price Distribution by App Type",
x = "Price ($)",
y = "Count") +
theme_minimal()
Upon analyzing the price distribution across different app types, we found that some values in the Type column do not accurately represent the app prices (from above plot). Since we can fully rely on the Price values for our analysis, the Type column is seemed unnecessary.
Hence, Removing the Type column…
#Using subset function
data_clean <- subset(data_clean, select = -Type)
#After removing the Type column and duplicated values
str(data_clean)
## 'data.frame': 9659 obs. of 12 variables:
## $ App : chr "Photo Editor & Candy Camera & Grid & ScrapBook" "Coloring book moana" "U Launcher Lite – FREE Live Cool Themes, Hide Apps" "Sketch - Draw & Paint" ...
## $ Category : chr "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" ...
## $ Rating : num 4.1 3.9 4.7 4.5 4.3 4.4 3.8 4.1 4.4 4.7 ...
## $ Reviews : chr "159" "967" "87510" "215644" ...
## $ Size : chr "19M" "14M" "8.7M" "25M" ...
## $ Installs : chr "10,000+" "500,000+" "5,000,000+" "50,000,000+" ...
## $ Price : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Content.Rating: chr "Everyone" "Everyone" "Everyone" "Teen" ...
## $ Genres : chr "Art & Design" "Art & Design;Pretend Play" "Art & Design" "Art & Design" ...
## $ Last.Updated : chr "January 7, 2018" "January 15, 2018" "August 1, 2018" "June 8, 2018" ...
## $ Current.Ver : chr "1.0.0" "2.0.0" "1.2.4" "Varies with device" ...
## $ Android.Ver : chr "4.0.3 and up" "4.0.3 and up" "4.0.3 and up" "4.2 and up" ...
head(data_clean)
## App Category Rating
## 1 Photo Editor & Candy Camera & Grid & ScrapBook ART_AND_DESIGN 4.1
## 2 Coloring book moana ART_AND_DESIGN 3.9
## 3 U Launcher Lite – FREE Live Cool Themes, Hide Apps ART_AND_DESIGN 4.7
## 4 Sketch - Draw & Paint ART_AND_DESIGN 4.5
## 5 Pixel Draw - Number Art Coloring Book ART_AND_DESIGN 4.3
## 6 Paper flowers instructions ART_AND_DESIGN 4.4
## Reviews Size Installs Price Content.Rating Genres
## 1 159 19M 10,000+ 0 Everyone Art & Design
## 2 967 14M 500,000+ 0 Everyone Art & Design;Pretend Play
## 3 87510 8.7M 5,000,000+ 0 Everyone Art & Design
## 4 215644 25M 50,000,000+ 0 Teen Art & Design
## 5 967 2.8M 100,000+ 0 Everyone Art & Design;Creativity
## 6 167 5.6M 50,000+ 0 Everyone Art & Design
## Last.Updated Current.Ver Android.Ver
## 1 January 7, 2018 1.0.0 4.0.3 and up
## 2 January 15, 2018 2.0.0 4.0.3 and up
## 3 August 1, 2018 1.2.4 4.0.3 and up
## 4 June 8, 2018 Varies with device 4.2 and up
## 5 June 20, 2018 1.1 4.4 and up
## 6 March 26, 2017 1.0 2.3 and up
The Type column is successfully removed.
Now that the price and Apps cleaning and Analysis is done.Now lets, proceed with Ratings and Reviews.
#clean installations
clean_installs <- function(Installs) {
Installs <- gsub("\\+", "", Installs) # Remove the '+' sign
Installs <- gsub(",", "", Installs) # Remove the commas
return(as.numeric(Installs)) # Convert to numeric
}
data_clean$Installs <- sapply(data_clean$Installs, clean_installs)
nan_rows <- sapply(data_clean[, c("Size", "Installs")], function(x) any(is.nan(x)))
# Display only rows that contain NaN in either Size or Installs
data_clean[,nan_rows]
## data frame with 0 columns and 9659 rows
datatable((data_clean), options = list(scrollX = TRUE ))
data_clean1 <- data_clean %>%
mutate(Rating = ifelse(is.na(Rating), mean(Rating, na.rm = TRUE), Rating))
# Step 1: Identify the unique values in the 'Installs' column
unique_values <- unique(data_clean1$Installs)
# Display the unique values
print(unique_values)
## [1] 1e+04 5e+05 5e+06 5e+07 1e+05 5e+04 1e+06 1e+07 5e+03 1e+08 1e+09 1e+03
## [13] 5e+08 5e+01 1e+02 5e+02 1e+01 1e+00 5e+00 0e+00
# Function to convert the installs to numeric
convert_to_numeric <- function(x) {
# Remove non-numeric characters and convert to numeric
as.numeric(gsub("[^0-9]", "", x)) * 10^(length(gregexpr(",", x)[[1]]) - 1)
}
# Sort unique values based on the custom numeric conversion
sorted_values <- unique_values[order(sapply(unique_values, convert_to_numeric))]
# Create a new data frame to store the factor levels
data_clean1_factor <- data_clean1 # Assuming you want to keep the original data intact
data_clean1_factor$Installs <- factor(data_clean1$Installs, levels = sorted_values)
# Create a bar plot with the ordered factor
ggplot(data_clean1_factor, aes(x = Installs)) +
geom_bar() +
xlab("Installs") +
ylab("Count") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) + # Rotate x-axis labels for readability
ggtitle("Distribution of App Installs")
—- Should add cleaning of Ratings and Reviews here —–
# Scatter plot for Installs vs Reviews
ggplot(data_clean1_factor, aes(x = Reviews, y = Installs)) +
geom_point(color = "blue", alpha = 0.5) +
labs(title = "Scatter Plot of Installs vs Reviews",
x = "Number of Reviews",
y = "Number of Installs") +
theme_minimal()
# Log-transform the Installs
data_clean$log_Installs <- log(data_clean$Installs)
# Scatter plot of log-transformed Installs vs. Rating
ggplot(data_clean, aes(x = log_Installs, y = Rating)) +
geom_point(color = "blue", alpha = 0.6) +
geom_smooth(method = "lm", color = "red", se = FALSE) + # Add a regression line
labs(title = "Log-Transformed Installs vs. Rating",
x = "Log(Installs)",
y = "Rating") +
theme_minimal()
Now, that we are done with the cleaning and analysis of Installs and Size variables. lets procedd with ratings and reviews!
## chr [1:9659] "159" "967" "87510" "215644" "967" "167" "178" "36815" ...
## num [1:9659] 4.1 3.9 4.7 4.5 4.3 4.4 3.8 4.1 4.4 4.7 ...
As we can see the Review column is in string format which could be converted into int for more insights.
## 'data.frame': 9659 obs. of 13 variables:
## $ App : chr "Photo Editor & Candy Camera & Grid & ScrapBook" "Coloring book moana" "U Launcher Lite – FREE Live Cool Themes, Hide Apps" "Sketch - Draw & Paint" ...
## $ Category : chr "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" ...
## $ Rating : num 4.1 3.9 4.7 4.5 4.3 4.4 3.8 4.1 4.4 4.7 ...
## $ Reviews : num 159 967 87510 215644 967 ...
## $ Size : chr "19M" "14M" "8.7M" "25M" ...
## $ Installs : num 1e+04 5e+05 5e+06 5e+07 1e+05 5e+04 5e+04 1e+06 1e+06 1e+04 ...
## $ Price : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Content.Rating: chr "Everyone" "Everyone" "Everyone" "Teen" ...
## $ Genres : chr "Art & Design" "Art & Design;Pretend Play" "Art & Design" "Art & Design" ...
## $ Last.Updated : chr "January 7, 2018" "January 15, 2018" "August 1, 2018" "June 8, 2018" ...
## $ Current.Ver : chr "1.0.0" "2.0.0" "1.2.4" "Varies with device" ...
## $ Android.Ver : chr "4.0.3 and up" "4.0.3 and up" "4.0.3 and up" "4.2 and up" ...
## $ log_Installs : num 9.21 13.12 15.42 17.73 11.51 ...
| App | Category | Rating | Reviews | Size | Installs | Price | Content.Rating | Genres | Last.Updated | Current.Ver | Android.Ver | log_Installs | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min | Length:9659 | Length:9659 | Min. :1.000 | Min. : 0 | Length:9659 | Min. :0.000e+00 | Min. : 0.000 | Length:9659 | Length:9659 | Length:9659 | Length:9659 | Length:9659 | Min. : -Inf |
| Q1 | Class :character | Class :character | 1st Qu.:4.000 | 1st Qu.: 25 | Class :character | 1st Qu.:1.000e+03 | 1st Qu.: 0.000 | Class :character | Class :character | Class :character | Class :character | Class :character | 1st Qu.: 6.908 |
| Median | Mode :character | Mode :character | Median :4.300 | Median : 967 | Mode :character | Median :1.000e+05 | Median : 0.000 | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Median :11.513 |
| Mean | NA | NA | Mean :4.173 | Mean : 216593 | NA | Mean :7.778e+06 | Mean : 1.099 | NA | NA | NA | NA | NA | Mean : -Inf |
| Q3 | NA | NA | 3rd Qu.:4.500 | 3rd Qu.: 29401 | NA | 3rd Qu.:1.000e+06 | 3rd Qu.: 0.000 | NA | NA | NA | NA | NA | 3rd Qu.:13.816 |
| Max | NA | NA | Max. :5.000 | Max. :78158306 | NA | Max. :1.000e+09 | Max. :400.000 | NA | NA | NA | NA | NA | Max. :20.723 |
| NA | NA | NA | NA’s :1463 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA |
There are 1463 missing values in rating.
df_na_rating <- data_clean %>% filter(is.na(Rating))
# Group by Category and count the number of NA ratings for each category
na_rating_distribution <- df_na_rating %>%
group_by(Category) %>%
summarise(count = n()) %>%
arrange(desc(count))
ggplot(na_rating_distribution, aes(x = reorder(Category, -count), y = count)) +
geom_bar(stat = "identity", fill = "steelblue") +
geom_text(aes(label = count),
position = position_stack(vjust = 0.5), # Center the text within the bars
color = "white", size = 3) + # Adjust text color and size
coord_flip() +
theme_minimal() +
labs(title = "Distribution of NA Ratings by Category",
x = "Category",
y = "Count of NA Ratings") +
theme(axis.text.y = element_text(size = 8))
As it could observed the Family category apps have the highest NA values. Let’s not drop them but handle them by replacing with the mean value for the category.
# Method 1: Replace NA in Ratings with Overall Mean
data_clean1 <- data_clean %>%
mutate(Rating = ifelse(is.na(Rating), mean(Rating, na.rm = TRUE), Rating))
xkablesummary(data_clean1)
| App | Category | Rating | Reviews | Size | Installs | Price | Content.Rating | Genres | Last.Updated | Current.Ver | Android.Ver | log_Installs | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min | Length:9659 | Length:9659 | Min. :1.000 | Min. : 0 | Length:9659 | Min. :0.000e+00 | Min. : 0.000 | Length:9659 | Length:9659 | Length:9659 | Length:9659 | Length:9659 | Min. : -Inf |
| Q1 | Class :character | Class :character | 1st Qu.:4.000 | 1st Qu.: 25 | Class :character | 1st Qu.:1.000e+03 | 1st Qu.: 0.000 | Class :character | Class :character | Class :character | Class :character | Class :character | 1st Qu.: 6.908 |
| Median | Mode :character | Mode :character | Median :4.200 | Median : 967 | Mode :character | Median :1.000e+05 | Median : 0.000 | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Median :11.513 |
| Mean | NA | NA | Mean :4.173 | Mean : 216593 | NA | Mean :7.778e+06 | Mean : 1.099 | NA | NA | NA | NA | NA | Mean : -Inf |
| Q3 | NA | NA | 3rd Qu.:4.500 | 3rd Qu.: 29401 | NA | 3rd Qu.:1.000e+06 | 3rd Qu.: 0.000 | NA | NA | NA | NA | NA | 3rd Qu.:13.816 |
| Max | NA | NA | Max. :5.000 | Max. :78158306 | NA | Max. :1.000e+09 | Max. :400.000 | NA | NA | NA | NA | NA | Max. :20.723 |
Now there are no missing values in reviews.
breaks = seq(15,20,by = 1)
frequency_table = table(data_clean1$Rating)
frequency_table
##
## 1 1.2 1.4 1.5
## 16 1 3 3
## 1.6 1.7 1.8 1.9
## 4 8 8 11
## 2 2.1 2.2 2.3
## 12 8 14 20
## 2.4 2.5 2.6 2.7
## 19 20 24 23
## 2.8 2.9 3 3.1
## 40 45 81 69
## 3.2 3.3 3.4 3.5
## 63 100 126 156
## 3.6 3.7 3.8 3.9
## 167 224 286 359
## 4 4.1 4.17324304538799 4.2
## 513 621 1463 810
## 4.3 4.4 4.5 4.6
## 897 895 848 683
## 4.7 4.8 4.9 5
## 442 221 85 271
From above it can be seen all the rating are between 1 and 5.
boxplot(data_clean1$Rating,ylab = "Rating", xlab = "Count",col = "Blue")
hist(data_clean1$Rating, main="Histogram of Apps Rating after cleaning", xlab="Rating (count)", col = 'blue', breaks = 100 )
qqnorm(data_clean1$Rating)
qqline(data_clean$Rating, col = "red")
Here, it could be seen the plots are much clearer but still skewed due to other outliers from 1-3 rating but as these may be the reason from which we could find why the apps are low rated hencecannot be removed from our dataset.
boxplot(data_clean1$Reviews,ylab = "Reviews", xlab = "Count",col = 'Blue')
hist(data_clean1$Reviews, main="Histogram of Apps Reviews", xlab="Reviews (count)", col = 'blue', breaks = 100 )
ggplot(data_clean1, aes(x = log(Reviews))) +
geom_histogram(binwidth = 0.1, fill = "blue", color = "black") +
labs(title = "Log-Transformed Histogram of Ratings", x = "Log(Rating)", y = "Count")
qqnorm(data_clean1$Reviews)
qqline(data_clean1$Reviews, col = "red")
Similar to the case of ratings the plots are skewed due to the outliers. Hence, we can use the log plot of reviews for the visualisation which is normalised version of Reviews. As they are skewed, they donot follow normal distribution.
xkablesummary(data_clean1)
| App | Category | Rating | Reviews | Size | Installs | Price | Content.Rating | Genres | Last.Updated | Current.Ver | Android.Ver | log_Installs | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min | Length:9659 | Length:9659 | Min. :1.000 | Min. : 0 | Length:9659 | Min. :0.000e+00 | Min. : 0.000 | Length:9659 | Length:9659 | Length:9659 | Length:9659 | Length:9659 | Min. : -Inf |
| Q1 | Class :character | Class :character | 1st Qu.:4.000 | 1st Qu.: 25 | Class :character | 1st Qu.:1.000e+03 | 1st Qu.: 0.000 | Class :character | Class :character | Class :character | Class :character | Class :character | 1st Qu.: 6.908 |
| Median | Mode :character | Mode :character | Median :4.200 | Median : 967 | Mode :character | Median :1.000e+05 | Median : 0.000 | Mode :character | Mode :character | Mode :character | Mode :character | Mode :character | Median :11.513 |
| Mean | NA | NA | Mean :4.173 | Mean : 216593 | NA | Mean :7.778e+06 | Mean : 1.099 | NA | NA | NA | NA | NA | Mean : -Inf |
| Q3 | NA | NA | 3rd Qu.:4.500 | 3rd Qu.: 29401 | NA | 3rd Qu.:1.000e+06 | 3rd Qu.: 0.000 | NA | NA | NA | NA | NA | 3rd Qu.:13.816 |
| Max | NA | NA | Max. :5.000 | Max. :78158306 | NA | Max. :1.000e+09 | Max. :400.000 | NA | NA | NA | NA | NA | Max. :20.723 |
outlierKD2(data_clean1,Reviews)
## Outliers identified: 1656
## Proportion (%) of outliers: 20.7
## Mean of the outliers: 1228141
## Mean without removing outliers: 216592.6
## Mean if we remove outliers: 7280.61
## Nothing changed
To check which are outliers lets make sections of data that is create bins to check which bins have maximum data, this would help us see how reviews are distributed.
Binning into equal count in each bin to check averge rating for each bin
# Define the new custom breaks for bins
# Ensure there are no NA values
# Define new breaks for more even intervals
breaks <- c(0, 100, 500, 1000, 2500, 5000, 10000, 25000,50000,100000, 300000,1000000,Inf)
# Create a categorical variable based on the new breaks
Review_Category <- cut(data_clean1$Reviews, breaks = breaks, right = FALSE,
labels = c("0+","100+", "500+", "1K+",
"2.5K+", "5K+", "10K+","25K+",
"50K+", "100K+","300K+","1M+"))
# Count the number of values in each bin
bin_counts <- as.data.frame(table(Review_Category))
# Rename the columns for clarity
colnames(bin_counts) <- c("Review_Category", "Count")
# Print the counts
print(bin_counts)
## Review_Category Count
## 1 0+ 3327
## 2 100+ 1065
## 3 500+ 462
## 4 1K+ 586
## 5 2.5K+ 475
## 6 5K+ 474
## 7 10K+ 719
## 8 25K+ 606
## 9 50K+ 498
## 10 100K+ 647
## 11 300K+ 451
## 12 1M+ 349
# Create a line plot of the binned counts
ggplot(bin_counts, aes(x = Review_Category, y = Count, group = 1)) +
geom_line(color = "blue", size = 1) +
geom_point(color = "blue", size = 3) +
labs(title = "Count of Reviews by Review Category",
x = "Review Category",
y = "Count of Reviews") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis labels for readability
Hence, high reviews can be observed in less apps and less reviews can be observed in more apps which is expected.
boxplot( data_clean1$Rating~ Review_Category, data = data_clean1,
main = "Boxplot of Review Counts by Review Category",
xlab = "Review Category",
ylab = "Review Rating",
las = 2, # Rotate the x-axis labels for readability
col = "lightblue") # Optional: Set color for the boxplots
In this we could observe that, as reviews increase the median of rating increased and the values clustered around higher ratings which could show that high reviews, could mean a better rated app.
# Calculate the mean Rating for each Review_Category
mean_ratings <- tapply(data_clean1$Rating, Review_Category, mean, na.rm = TRUE)
# Convert the result to a data frame for better readability
mean_ratings_df <- data.frame(Review_Category = names(mean_ratings), Mean_Rating = as.numeric(mean_ratings))
# Print the mean ratings for each review bin
print(mean_ratings_df)
## Review_Category Mean_Rating
## 1 0+ 4.126221
## 2 100+ 4.029538
## 3 500+ 4.063188
## 4 1K+ 4.107030
## 5 2.5K+ 4.129572
## 6 5K+ 4.191139
## 7 10K+ 4.221836
## 8 25K+ 4.231848
## 9 50K+ 4.293775
## 10 100K+ 4.329830
## 11 300K+ 4.375610
## 12 1M+ 4.426361
# Define correct order of Review_Category as a factor
mean_ratings_df$Review_Category <- factor(mean_ratings_df$Review_Category,
levels = c("0+","100+", "500+", "1K+",
"2.5K+", "5K+", "10K+","25K+",
"50K+", "100K+", "300K+", "1M+"))
# Plot the mean ratings for each review bin in the correct order
ggplot(mean_ratings_df, aes(x = Review_Category, y = Mean_Rating)) +
geom_bar(stat = "identity", fill = "steelblue") + # Use bar plot
labs(title = "Mean Rating by Review Category",
x = "Review Category",
y = "Mean Rating") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis labels for readability
As we can see, the mean rating increases as the reviews increase.
# Create a new data frame for plotting
plot_data <- data.frame(Rating = data_clean1$Rating, Review_Category = Review_Category)
# Create a histogram of Ratings, faceted by Review_Category
ggplot(plot_data, aes(x = Rating)) +
geom_histogram(bins = 30, fill = "blue", alpha = 0.7) +
facet_wrap(~ Review_Category, labeller = label_wrap_gen()) + # Facet by Review_Category
theme_minimal() +
labs(title = "Histograms of Ratings by Review Category", x = "Rating", y = "Frequency")
This is another representation of ratings vs reviews.
The tests below are to test whether or not different categories have different average ratings.
anova_result <- aov(Rating ~ as.factor(Review_Category), data = data_clean1)
summary(anova_result)
## Df Sum Sq Mean Sq F value Pr(>F)
## as.factor(Review_Category) 11 106.3 9.662 41.36 <2e-16 ***
## Residuals 9647 2253.6 0.234
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
According to p-value, it is significant hence we can say that the average rating for all review categories is not same.
# Perform Tukey's HSD
tukey_result <- TukeyHSD(anova_result)
tukey_result
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = Rating ~ as.factor(Review_Category), data = data_clean1)
##
## $`as.factor(Review_Category)`
## diff lwr upr p adj
## 100+-0+ -0.096683215 -0.152307271 -0.04105916 0.0000009
## 500+-0+ -0.063032835 -0.141474646 0.01540898 0.2646281
## 1K+-0+ -0.019190832 -0.089971134 0.05158947 0.9992526
## 2.5K+-0+ 0.003350463 -0.074143085 0.08084401 1.0000000
## 5K+-0+ 0.064918154 -0.012646893 0.14248320 0.2087515
## 10K+-0+ 0.095614797 0.030638525 0.16059107 0.0000973
## 25K+-0+ 0.105627098 0.035846939 0.17540726 0.0000488
## 50K+-0+ 0.167554014 0.091642554 0.24346547 0.0000000
## 100K+-0+ 0.203608898 0.135724795 0.27149300 0.0000000
## 300K+-0+ 0.249388670 0.170111342 0.32866600 0.0000000
## 1M+-0+ 0.300139945 0.211244127 0.38903576 0.0000000
## 500+-100+ 0.033650380 -0.054364565 0.12166533 0.9848292
## 1K+-100+ 0.077492383 -0.003768703 0.15875347 0.0784345
## 2.5K+-100+ 0.100033678 0.012862795 0.18720456 0.0096675
## 5K+-100+ 0.161601369 0.074366918 0.24883582 0.0000001
## 10K+-100+ 0.192298012 0.116039053 0.26855697 0.0000000
## 25K+-100+ 0.202310313 0.121918874 0.28270175 0.0000000
## 50K+-100+ 0.264237229 0.178469737 0.35000472 0.0000000
## 100K+-100+ 0.300292113 0.221540831 0.37904339 0.0000000
## 300K+-100+ 0.346071885 0.257311491 0.43483228 0.0000000
## 1M+-100+ 0.396823160 0.299375844 0.49427048 0.0000000
## 1K+-500+ 0.043842003 -0.054455739 0.14213974 0.9515761
## 2.5K+-500+ 0.066383298 -0.036853541 0.16962014 0.6214468
## 5K+-500+ 0.127950989 0.024660470 0.23124151 0.0030189
## 10K+-500+ 0.158647632 0.064443010 0.25285225 0.0000025
## 25K+-500+ 0.168659933 0.071079887 0.26623998 0.0000011
## 50K+-500+ 0.230586849 0.128532233 0.33264146 0.0000000
## 100K+-500+ 0.266641733 0.170408442 0.36287502 0.0000000
## 300K+-500+ 0.312421505 0.207839051 0.41700396 0.0000000
## 1M+-500+ 0.363172780 0.251123410 0.47522215 0.0000000
## 2.5K+-1K+ 0.022541295 -0.075001405 0.12008400 0.9998394
## 5K+-1K+ 0.084108986 -0.013490527 0.18170850 0.1727899
## 10K+-1K+ 0.114805629 0.026878134 0.20273312 0.0012014
## 25K+-1K+ 0.124817930 0.033283243 0.21635262 0.0005180
## 50K+-1K+ 0.186744846 0.090454254 0.28303544 0.0000000
## 100K+-1K+ 0.222799730 0.132702117 0.31289734 0.0000000
## 300K+-1K+ 0.268579502 0.169613735 0.36754527 0.0000000
## 1M+-1K+ 0.319330777 0.212504774 0.42615678 0.0000000
## 5K+-2.5K+ 0.061567691 -0.041004546 0.16413993 0.7193424
## 10K+-2.5K+ 0.092264334 -0.001152170 0.18568084 0.0565429
## 25K+-2.5K+ 0.102276635 0.005457227 0.19909604 0.0276896
## 50K+-2.5K+ 0.164203551 0.062875978 0.26553112 0.0000078
## 100K+-2.5K+ 0.200258435 0.104796512 0.29572036 0.0000000
## 300K+-2.5K+ 0.246038206 0.142165102 0.34991131 0.0000000
## 1M+-2.5K+ 0.296789482 0.185401898 0.40817707 0.0000000
## 10K+-5K+ 0.030696643 -0.062779181 0.12417247 0.9957463
## 25K+-5K+ 0.040708944 -0.056167701 0.13758559 0.9685508
## 50K+-5K+ 0.102635860 0.001253596 0.20401812 0.0440982
## 100K+-5K+ 0.138690744 0.043170771 0.23421072 0.0001331
## 300K+-5K+ 0.184470516 0.080544059 0.28839697 0.0000004
## 1M+-5K+ 0.235221791 0.123784453 0.34665913 0.0000000
## 25K+-10K+ 0.010012302 -0.077112114 0.09713672 0.9999999
## 50K+-10K+ 0.071939217 -0.020169104 0.16404754 0.3070668
## 100K+-10K+ 0.107994101 0.022380758 0.19360745 0.0022235
## 300K+-10K+ 0.153773873 0.058872409 0.24867534 0.0000078
## 1M+-10K+ 0.204525148 0.101453039 0.30759726 0.0000000
## 50K+-25K+ 0.061926916 -0.033630908 0.15748474 0.6094814
## 100K+-25K+ 0.097981800 0.008667751 0.18729585 0.0175649
## 300K+-25K+ 0.143761571 0.045508620 0.24201452 0.0001113
## 1M+-25K+ 0.194512847 0.088346871 0.30067882 0.0000001
## 100K+-50K+ 0.036054884 -0.058127272 0.13023704 0.9846717
## 300K+-50K+ 0.081834656 -0.020863551 0.18453286 0.2768896
## 1M+-50K+ 0.132585931 0.022293168 0.24287869 0.0048805
## 300K+-100K+ 0.045779772 -0.051135776 0.14269532 0.9282456
## 1M+-100K+ 0.096531047 -0.008398431 0.20146052 0.1064662
## 1M+-300K+ 0.050751275 -0.061884591 0.16338714 0.9479902
# Convert the result to a data frame
tukey_df <- as.data.frame(tukey_result$`as.factor(Review_Category)`)
# Filter for significant p-values
significant_tukey <- tukey_df[tukey_df[4] < 0.05, ]
# Display the significant results
print(significant_tukey)
## diff lwr upr p adj
## 100+-0+ -0.09668322 -0.152307271 -0.04105916 8.987756e-07
## 10K+-0+ 0.09561480 0.030638525 0.16059107 9.732720e-05
## 25K+-0+ 0.10562710 0.035846939 0.17540726 4.884843e-05
## 50K+-0+ 0.16755401 0.091642554 0.24346547 0.000000e+00
## 100K+-0+ 0.20360890 0.135724795 0.27149300 0.000000e+00
## 300K+-0+ 0.24938867 0.170111342 0.32866600 0.000000e+00
## 1M+-0+ 0.30013994 0.211244127 0.38903576 0.000000e+00
## 2.5K+-100+ 0.10003368 0.012862795 0.18720456 9.667490e-03
## 5K+-100+ 0.16160137 0.074366918 0.24883582 9.538328e-08
## 10K+-100+ 0.19229801 0.116039053 0.26855697 0.000000e+00
## 25K+-100+ 0.20231031 0.121918874 0.28270175 0.000000e+00
## 50K+-100+ 0.26423723 0.178469737 0.35000472 0.000000e+00
## 100K+-100+ 0.30029211 0.221540831 0.37904339 0.000000e+00
## 300K+-100+ 0.34607188 0.257311491 0.43483228 0.000000e+00
## 1M+-100+ 0.39682316 0.299375844 0.49427048 0.000000e+00
## 5K+-500+ 0.12795099 0.024660470 0.23124151 3.018884e-03
## 10K+-500+ 0.15864763 0.064443010 0.25285225 2.473396e-06
## 25K+-500+ 0.16865993 0.071079887 0.26623998 1.080775e-06
## 50K+-500+ 0.23058685 0.128532233 0.33264146 0.000000e+00
## 100K+-500+ 0.26664173 0.170408442 0.36287502 0.000000e+00
## 300K+-500+ 0.31242150 0.207839051 0.41700396 0.000000e+00
## 1M+-500+ 0.36317278 0.251123410 0.47522215 0.000000e+00
## 10K+-1K+ 0.11480563 0.026878134 0.20273312 1.201416e-03
## 25K+-1K+ 0.12481793 0.033283243 0.21635262 5.179950e-04
## 50K+-1K+ 0.18674485 0.090454254 0.28303544 1.572425e-08
## 100K+-1K+ 0.22279973 0.132702117 0.31289734 0.000000e+00
## 300K+-1K+ 0.26857950 0.169613735 0.36754527 0.000000e+00
## 1M+-1K+ 0.31933078 0.212504774 0.42615678 0.000000e+00
## 25K+-2.5K+ 0.10227664 0.005457227 0.19909604 2.768961e-02
## 50K+-2.5K+ 0.16420355 0.062875978 0.26553112 7.808701e-06
## 100K+-2.5K+ 0.20025843 0.104796512 0.29572036 3.507881e-10
## 300K+-2.5K+ 0.24603821 0.142165102 0.34991131 0.000000e+00
## 1M+-2.5K+ 0.29678948 0.185401898 0.40817707 0.000000e+00
## 50K+-5K+ 0.10263586 0.001253596 0.20401812 4.409823e-02
## 100K+-5K+ 0.13869074 0.043170771 0.23421072 1.331239e-04
## 300K+-5K+ 0.18447052 0.080544059 0.28839697 4.428778e-07
## 1M+-5K+ 0.23522179 0.123784453 0.34665913 2.244944e-10
## 100K+-10K+ 0.10799410 0.022380758 0.19360745 2.223466e-03
## 300K+-10K+ 0.15377387 0.058872409 0.24867534 7.832139e-06
## 1M+-10K+ 0.20452515 0.101453039 0.30759726 5.942656e-09
## 100K+-25K+ 0.09798180 0.008667751 0.18729585 1.756493e-02
## 300K+-25K+ 0.14376157 0.045508620 0.24201452 1.113055e-04
## 1M+-25K+ 0.19451285 0.088346871 0.30067882 1.436204e-07
## 1M+-50K+ 0.13258593 0.022293168 0.24287869 4.880458e-03
As we can see, the significant difference for average rating for different review categories is between 0+ and 1M+ as expected.
For easier Ratings and Reviews vs Installs we can group Installs into categories given
# Step 1: Identify the unique values in the 'Installs' column
unique_values <- unique(data_clean1$Installs)
# Function to convert installs to numeric
convert_to_numeric <- function(x) {
# Remove non-numeric characters and convert to numeric
as.numeric(gsub("[^0-9]", "", x)) * 10^(length(gregexpr(",", x)[[1]]) - 1)
}
# Sort unique values based on the custom numeric conversion
sorted_values <- unique_values[order(sapply(unique_values, convert_to_numeric))]
# Create a bar plot with the ordered factor without adding a new column
ggplot(data = data_clean1, aes(x = factor(Installs, levels = sorted_values))) +
geom_bar(fill = "blue", alpha = 0.7) +
xlab("Installs") +
ylab("Count") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) + # Rotate x-axis labels for readability
ggtitle("Distribution of App Installs")
Now we can check what is the average rating for each Install category and what is the relationship between them
# Function to convert installs to numeric
convert_to_numeric <- function(x) {
as.numeric(gsub("[^0-9]", "", x)) * 10^(length(gregexpr(",", x)[[1]]) - 1)
}
# Step 1: Calculate mean ratings and counts for each install category using dplyr
data_mean <- data_clean1 %>%
group_by(Installs) %>%
summarise(Mean_Rating = mean(Rating, na.rm = TRUE), Count = n()) %>%
ungroup()
# Sort install categories
sorted_installs <- data_mean$Installs[order(sapply(data_mean$Installs, convert_to_numeric))]
# Create dot plot with size based on the count of ratings
ggplot(data_mean, aes(x = factor(Installs, levels = sorted_installs), y = Mean_Rating)) +
geom_point(aes(size = Count), color = "blue", alpha = 0.7) + # Size based on count of ratings
geom_segment(aes(x = factor(Installs, levels = sorted_installs),
xend = factor(Installs, levels = sorted_installs),
y = 0, yend = Mean_Rating), color = "grey", linetype = "dashed") +
labs(title = "Mean Ratings by Install Category", x = "Install Categories", y = "Mean Ratings") +
scale_size_continuous(name = "Number of Ratings") + # Add legend for size
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # Rotate x-axis labels for readability
The analysis reveals that both low and high install counts correspond to high ratings. However, apps with a greater number of installs and high ratings are generally regarded as superior, as indicated by the density of the dots in the plot, which reflects the volume of ratings they have received.
Now, we are done with ratings and reviews as well.
length(unique(data_clean$Category))
## [1] 33
length(unique(data_clean$Genres))
## [1] 118
There are 34 categories in the the dataframe with 119 genres. This means that in each category, there are multiple genres. Given that, the later analyses in this project can be proceeded with Category variable.
Below is the graph for the distribution of Categories for the dataset after removing duplicates.
#Distribution for Category
category_counts <- table(data_clean$Category)
# Convert to data frame for plotting
category_counts_df <- as.data.frame(category_counts)
colnames(category_counts_df) <- c("Category", "Frequency")
ggplot(category_counts_df, aes(x = reorder(Category, Frequency), y = Frequency)) +
geom_bar(stat = "identity", fill = "#1f3374") +
geom_text(aes(label = Frequency), vjust = 0.5, hjust=1, size=2.5, color='#f8c220') +
coord_flip() +
labs(title = "Distribution of Categories", x = "Category", y = "Frequency") +
theme_minimal() +
theme(
plot.background = element_rect(fill = "#efefef", color = NA),
panel.background = element_rect(fill = "#efefef", color = NA),
axis.text.y = element_text(size = 5.5)
)
Below is a boxplot show the distribution of number of installs for each category.
ggplot(data_clean, aes(x = reorder(Category, log_Installs), y = log_Installs)) +
geom_boxplot(outlier.color = "#f05555", outlier.shape = 1, color='#1f3374', fill="#efefef") + # Red outliers for emphasis
coord_flip() + # Flip for better readability
scale_y_log10() + # Log scale for clearer comparison
theme_minimal() +
labs(title = "Distribution of Installs by Category",
x = "Category",
y = "Number of Installs (Log Scale)") +
theme(
plot.background = element_rect(fill = "#efefef", color = NA),
panel.background = element_rect(fill = "#efefef", color = NA),
axis.text.y = element_text(size = 5.5)
)
convert_size <- function(size) {
size <- gsub(",", "", size) # Remove commas
size <- tolower(size) # Make lowercase for consistency
# Handle "varies with device" by assigning NA
if (size == "varies with device") return(NA)
# Convert "k" to MB (1 MB = 1000 KB)
if (grepl("k", size)) return(as.numeric(gsub("k", "", size)) / 1000)
# Convert "M" to numeric MB
if (grepl("m", size)) return(as.numeric(gsub("m", "", size)))
# Handle numeric values directly (e.g., "1000+")
if (grepl("\\d+\\+", size)) return(as.numeric(gsub("\\+", "", size)) / 1000)
# Default case: return as numeric if possible
return(as.numeric(size))
}
df_clean <- data_clean %>%
mutate(Size = sapply(Size, convert_size)) %>%
filter(!is.na(Size))
# Plot the histogram with faceting by category
ggplot(df_clean, aes(x = Size)) +
geom_histogram(binwidth = 5, fill = "#304ba6", color = "black") +
facet_wrap(~ Category, scales = "free_y") +
theme_minimal() +
labs(
title = "Distribution of App Sizes by Category",
x = "Size (MB)",
y = "Count"
) +
theme(
strip.text = element_text(size = 8),
axis.text.x = element_text(size = 7, angle = 45, hjust = 1)
)
ggplot(df_clean, aes(x = reorder(Category, Size, FUN = median), y = Size)) +
geom_boxplot(outlier.color = "#f05555", outlier.shape = 1) +
coord_flip() +
theme_minimal() +
labs(
title = "Boxplot of App Sizes by Category (Ordered by Median)",
x = "Category",
y = "Size (MB)"
) +
theme(
strip.text = element_text(size = 8),
axis.text.x = element_text(size = 7, angle = 45, hjust = 1)
)
df_aggregated <- data_clean %>%
group_by(Category) %>%
summarise(Total_Reviews = sum(Reviews, na.rm = TRUE))
# Plot the total reviews by category using a bar chart
ggplot(df_aggregated, aes(x = reorder(Category, -Total_Reviews), y = log10(Total_Reviews))) +
geom_bar(stat = "identity", fill = "#1f3374") +
labs(
title = "Log-Scaled Total Reviews by Category",
x = "Category",
y = "Log10(Total Number of Reviews)"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggplot(df_clean, aes(x = Rating)) +
geom_histogram(binwidth = 0.5, fill = "#1f3374", color='#efefef') +
facet_wrap(~ Category, scales = "free_y") + # Facet by Category with independent y-axis
scale_x_continuous(limits = c(1, 5), breaks = seq(1, 5, by = 0.5)) + # Restrict x-axis to 1-5
theme_minimal() +
labs(
title = "Distribution of Ratings by Category",
x = "Rating",
y = "Count"
) +
theme(
strip.text = element_text(size = 5), # Adjust facet label size
axis.text.x = element_text(size = 5, angle = 45, hjust = 1), # Rotate x-axis labels
plot.title = element_text(hjust = 0.5) # Center the plot title
)
Due to the inconsistent formatting of values in the
Current.Ver column, this column is dropped and will be
excluded from the analysis.
Below is the figure showing the distribution of Android versions.
extract_version <- function(version) {
version <- tolower(version) # Make lowercase for consistency
# Handle "Varies with device" and "NaN"
if (version == "varies with device" || version == "nan") return(NA)
# Extract the first version in case of ranges (e.g., "4.1 - 7.1.1" -> "4.1")
first_version <- strsplit(version, "[- ]")[[1]][1]
# Remove "and up" if present (e.g., "4.0 and up" -> "4.0")
first_version <- gsub("and up", "", first_version)
return(as.numeric(first_version)) # Convert to numeric
}
data_clean <- data_clean %>%
mutate(Android_Ver = sapply(Android.Ver, extract_version)) %>%
filter(!is.na(Android_Ver)) # Remove rows with NA in Android_Ver
android_installs <- data_clean %>%
group_by(Android.Ver) %>%
summarize(Total_Installs = sum(Installs, na.rm = TRUE))
ggplot(data_clean, aes(x = Android_Ver)) +
geom_histogram(binwidth = 0.5, fill = "#1f3374", color='#efefef') +
scale_x_continuous(breaks = seq(1, 8, by = 1.0)) + # Set x-axis ticks from 1.0 to 8.0
theme_minimal() +
labs(
title = "Distribution of Android Versions",
x = "Android Version",
y = "Count"
) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
extract_version <- function(version) {
version <- tolower(version) # Make lowercase for consistency
# Handle "Varies with device" and "NaN"
if (version == "varies with device" || version == "nan") return(NA)
# Extract the first version in case of ranges (e.g., "4.1 - 7.1.1" -> "4.1")
first_version <- strsplit(version, "[- ]")[[1]][1]
# Remove "and up" if present (e.g., "4.0 and up" -> "4.0")
first_version <- gsub("and up", "", first_version)
return(as.numeric(first_version)) # Convert to numeric
}
ggplot(data_clean, aes(x = reorder(Android.Ver, -Installs), y = Installs)) +
geom_bar(stat = "identity", fill = "#1f3374") +
coord_flip() + # Flip coordinates for better readability
scale_y_continuous(labels = scales::comma) + # Format y-axis with commas
theme_minimal() +
labs(
title = "Total Installs by Android Version",
x = "Android Version",
y = "Total Installs"
) +
theme(
axis.text.y = element_text(size = 8), # Adjust y-axis text size
plot.title = element_text(hjust = 0.5) # Center the plot title
)
data_clean <- data_clean %>%
filter(!is.na(Android.Ver) & !is.na(Reviews)) %>%
mutate(Scaled_Reviews = log10(Reviews + 1))
ggplot(data_clean, aes(x = reorder(Android.Ver, Scaled_Reviews, FUN = median), y = Scaled_Reviews)) +
geom_boxplot(outlier.color = "#f05555", outlier.shape = 1) + # Boxplot with red outliers
coord_flip() + # Flip coordinates for better readability
theme_minimal() +
labs(
title = "Distribution of Scaled Reviews by Android Version",
x = "Android Version",
y = "Scaled Reviews (Log10)"
) +
theme(
axis.text.y = element_text(size = 8), # Adjust y-axis text size
plot.title = element_text(hjust = 0.5) # Center the plot title
)
ggplot(df_clean, aes(x = Rating, fill = Android.Ver)) +
geom_histogram(binwidth = 0.5, position = "stack", color = "black", alpha = 0.7) +
scale_x_continuous(breaks = seq(1, 5, by = 0.5)) + # Set x-axis breaks
theme_minimal() +
labs(
title = "Histogram of Ratings by Android Version",
x = "Rating",
y = "Count"
) +
theme(
axis.text.x = element_text(size = 8),
axis.text.y = element_text(size = 8),
plot.title = element_text(hjust = 0.5) # Center the plot title
)
data_final <- data_clean %>% select(-c('Genres', 'Current.Ver'))
str(data_final)
## 'data.frame': 6979 obs. of 13 variables:
## $ App : chr "Sketch - Draw & Paint" "Pixel Draw - Number Art Coloring Book" "Paper flowers instructions" "Infinite Painter" ...
## $ Category : chr "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" "ART_AND_DESIGN" ...
## $ Rating : num 4.5 4.3 4.4 4.1 4.4 4.4 4.4 4.2 4.6 4.4 ...
## $ Reviews : num 215644 967 167 36815 13791 ...
## $ Size : chr "25M" "2.8M" "5.6M" "29M" ...
## $ Installs : num 5e+07 1e+05 5e+04 1e+06 1e+06 1e+06 1e+06 1e+07 1e+05 1e+05 ...
## $ Price : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Content.Rating: chr "Teen" "Everyone" "Everyone" "Everyone" ...
## $ Last.Updated : chr "June 8, 2018" "June 20, 2018" "March 26, 2017" "June 14, 2018" ...
## $ Android.Ver : chr "4.2 and up" "4.4 and up" "2.3 and up" "4.2 and up" ...
## $ log_Installs : num 17.7 11.5 10.8 13.8 13.8 ...
## $ Android_Ver : Named num 4.2 4.4 2.3 4.2 3 4.1 4 4.1 4.4 2.3 ...
## ..- attr(*, "names")= chr [1:6979] "4.2 and up" "4.4 and up" "2.3 and up" "4.2 and up" ...
## $ Scaled_Reviews: num 5.33 2.99 2.23 4.57 4.14 ...
Lets do bivariate analysis on price and other variables from now.
library(ggplot2)
#Plotting a scatter plot between Price and installs
ggplot(data_final, aes(x=Price, y=log_Installs)) +
geom_point(color = 'red', size = 1, alpha = 0.5) +
geom_smooth(method = 'lm', color = 'blue', se = FALSE) +
labs(title = "Price vs Installs", x = "Price (USD)", y = "Number of Installs") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
From the scatter plot, we can see that there are more number of
installations with price value 0.
# Categorize the apps as "Free" or "Paid" based on Price
Price_Category <- ifelse(data_final$Price == 0, "Free", "Paid")
str(data_final$Price)
## num [1:6979] 0 0 0 0 0 0 0 0 0 0 ...
str(Price_Category)
## chr [1:6979] "Free" "Free" "Free" "Free" "Free" "Free" "Free" "Free" ...
str(data_final$log_Installs)
## num [1:6979] 17.7 11.5 10.8 13.8 13.8 ...
For a better visualization, we are categorizing price values 0 as free apps and plotting abox plot.
# Box plot of Price Category vs. log-transformed Installs
ggplot(data_final, aes(x = Price_Category, y = log_Installs)) +
geom_boxplot(fill = "lightblue", color = "darkblue", alpha = 0.6) +
labs(title = "Price Categories vs. Log-Transformed Installs",
x = "Price Category",
y = "Log(Installs)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
“Free” apps tend to have more installs than “Paid” apps. The difference between the means on the log scale is estimated to be between 3.47 and 3.97.
# Categorize the apps as "Free" or "Paid" based on Price
Price_Category <- ifelse(data_final$Price == 0, "Free", "Paid")
str(data_final$Price)
## num [1:6979] 0 0 0 0 0 0 0 0 0 0 ...
str(Price_Category)
## chr [1:6979] "Free" "Free" "Free" "Free" "Free" "Free" "Free" "Free" ...
str(data_final$log_Installs)
## num [1:6979] 17.7 11.5 10.8 13.8 13.8 ...
table(Price_Category)
## Price_Category
## Free Paid
## 6406 573
# Check for missing values and ensure no negative/zero values in log_Installs
#data_final <- data_final %>%
#filter(!is.na(Installs), Installs > 0) # Remove missing values and zeros in Installs
# Apply log transformation, adding 1 to avoid log(0)
#data_final$log_Installs <- log(data_final$Installs + 1)
# Ensure Price_Category has no missing values
#data_final <- data_final %>%
#filter(!is.na(Price_Category))
#Perform t-test on log-transformed Installs by Price Category
#t_test_result <- t.test(log_Installs ~ Price_Category, data = data_final, var.equal = FALSE)
#Print t-test results
#print(t_test_result)
There is a statistically significant difference between the number of installs for “Free” and “Paid” apps, with the p-value being extremely small.
From the above analysis, we can practically state that free apps are more popular than paid apps, which can be considered true in the app market.
# Add Price_Category to data_final
data_duplicate <- data_final
data_duplicate$Price_Category <- ifelse(data_final$Price == 0, "Free", "Paid")
# Create a summarized table for Price_Category and log_Installs
summary_table <- data_duplicate %>%
group_by(Price_Category) %>%
summarise(Average_Log_Installs = mean(log_Installs, na.rm = TRUE),
Count = n())
# View the summarized table
kable(summary_table, format = "html", col.names = c("Price Category", "Mean Log(Installs)", "App Count")) %>%
kable_styling(full_width = FALSE, position = "center")
| Price Category | Mean Log(Installs) | App Count |
|---|---|---|
| Free | -Inf | 6406 |
| Paid | -Inf | 573 |
# Plot Price vs. Reviews
ggplot(data_final, aes(x=Price, y=Reviews)) +
geom_point(color = 'blue') +
geom_smooth(method = 'lm', color = 'red', se = FALSE) +
labs(title = "Price vs Reviews", x = "Price (USD)", y = "Number of Reviews") +
theme_minimal() +
theme(
panel.background = element_rect(fill = "white"), # Set panel background to white
plot.background = element_rect(fill = "white"), # Set plot background to white
axis.text.x = element_text(angle = 45, hjust = 1)
)
# Plot Price vs. Rating
ggplot(data_final, aes(x=Price, y=Rating)) +
geom_point(color = 'green') +
geom_smooth(method = 'lm', color = 'red', se = FALSE) +
labs(title = "Price vs Rating", x = "Price (USD)", y = "Rating") +
theme_minimal() +
theme(
panel.background = element_rect(fill = "white"), # Set panel background to white
plot.background = element_rect(fill = "white"), # Set plot background to white
axis.text.x = element_text(angle = 45, hjust = 1)
)
Price vs Reviews with installation: Cheaper products tend to have more reviews, indicating higher popularity or more frequent purchases. In contrast, expensive products tend to have fewer reviews, possibly because fewer people buy higher-priced items.
Price vs Ratings with installation: Price does not strongly affect the average rating, but there is a slight trend where lower-priced products have more variation in ratings, while higher-priced products tend to receive more consistent ratings around 4. May be higher price apps are meeting the customer expectations.
#Confirming with a t-test
# Perform t-test for Reviews between Free and Paid
t_test_reviews <- t.test(Reviews ~ Price_Category, data = data_final)
# Perform t-test for Rating between Free and Paid
t_test_rating <- t.test(Rating ~ Price_Category, data = data_final)
# Print the results
print(t_test_reviews)
##
## Welch Two Sample t-test
##
## data: Reviews by Price_Category
## t = 10.408, df = 6492.9, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Free and group Paid is not equal to 0
## 95 percent confidence interval:
## 109599.3 160464.8
## sample estimates:
## mean in group Free mean in group Paid
## 140191.978 5159.899
print(t_test_rating)
##
## Welch Two Sample t-test
##
## data: Rating by Price_Category
## t = -4.2088, df = 530.31, p-value = 3.015e-05
## alternative hypothesis: true difference in means between group Free and group Paid is not equal to 0
## 95 percent confidence interval:
## -0.16392825 -0.05959785
## sample estimates:
## mean in group Free mean in group Paid
## 4.144837 4.256600
There is a statistically significant difference between the mean number of reviews for Free and Paid apps. Free apps have significantly more reviews on average.
There is a statistically significant difference between the mean ratings for Free and Paid apps. Paid apps have slightly higher ratings on average, though the difference is small.
# Check correlation between variables
correlation_matrix <- data_final %>%
select(Price, Reviews, Rating, log_Installs) %>%
cor(use = "complete.obs")
correlation_matrix
## Price Reviews Rating log_Installs
## Price 1.000000000 -0.009041121 -0.02246442 -0.05681441
## Reviews -0.009041121 1.000000000 0.06964788 0.24935574
## Rating -0.022464425 0.069647877 1.00000000 0.06464616
## log_Installs -0.056814407 0.249355741 0.06464616 1.00000000
# Create a colorful correlation matrix
corrplot(correlation_matrix, method = "color",
col = colorRampPalette(c("red", "white", "blue"))(200),
type = "upper",
tl.col = "black", tl.srt = 45,
addCoef.col = "black", # Add correlation coefficients
number.cex = 0.7, # Adjust size of numbers
title = "Correlation Matrix", # Title
mar = c(0, 0, 1, 0)) # Margins
Price vs. Log_Installs: -0.06, suggesting a very weak negative relationship between price and the number of installs.
# Scatter plot of Price vs. Ratings with log_Installs as color
ggplot(data_final, aes(x = Price, y = Rating,color = log_Installs)) +
geom_point(alpha = 0.6) +
scale_color_gradient(low = "blue", high = "red") +
labs(title = "Price vs. Ratings with Installs as Color by Price",
x = "Price",
y = "Rating",
color = "log(Installs)") +
theme_minimal()
# Scatter plot of Price vs. Reviews with log_Installs as color
ggplot(data_final, aes(x = Price, y = Reviews,color = log_Installs)) +
geom_point(alpha = 0.6) +
scale_color_gradient(low = "darkgreen", high = "yellow") +
labs(title = "Price vs. reviewss with Installs as Color by Price",
x = "Price",
y = "Reviews",
color = "log(Installs)") +
theme_minimal()
Concluding: Apps with lower prices, have more ratings and installs while apps priced higher tend to have fewer installs and more scattered ratings. Similarly, for reviews.
# Plot Price vs Size
ggplot(data_final, aes(x=Price, y=Size)) +
geom_point(color = 'red') +
geom_smooth(method = 'lm', color = 'blue', se = FALSE) +
labs(title = "Price vs Size", x = "Price (USD)", y = "App Size (MB)") +
theme_minimal()
# Remove leading and trailing spaces and convert all text to a consistent format
data_final$Content.Rating <- trimws(tolower(data_final$Content.Rating))
cr_missing <- sum(is.na(data_final$`Content Rating`))
print(paste("Number of missing values in 'Content Rating':", cr_missing))
## [1] "Number of missing values in 'Content Rating': 0"
There are no missing values for Content rating.
# Convert Last Updated to Date format
data_final$Last.Updated <- as.Date(data_final$Last.Updated, format = "%B %d, %Y")
# Verify the cleaning
print("\nSummary of Last.Updated after cleaning:")
## [1] "\nSummary of Last.Updated after cleaning:"
print(summary(data_clean$Last.Updated))
## Length Class Mode
## 6979 character character
# 1. Content Rating Distribution
content_rating_dist <- table(data_final$Content.Rating)
print("Content Rating Distribution:")
## [1] "Content Rating Distribution:"
print(content_rating_dist)
##
## adults only 18+ everyone everyone 10+ mature 17+ teen
## 2 5777 221 264 714
## unrated
## 1
# Bar plot for Content Rating
ggplot(data_final, aes(x = Content.Rating)) +
geom_bar(fill = "skyblue") +
geom_text(stat = "count", aes(label = ..count..), vjust = -0.5) +
labs(title = "Distribution of App Content Ratings",
x = "Content Rating",
y = "Number of Apps") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Frequency of different content ratings assigned to a set of apps, each
bar represents the count of apps for a specific content rating
category
# Last Updated Analysis
# Create summary of updates by month and year
updates_by_month <- data_final %>%
mutate(
update_month = format(Last.Updated, "%Y-%m"),
update_year = year(Last.Updated)
) %>%
group_by(update_month) %>%
summarize(count = n()) %>%
arrange(update_month)
# Plot updates over time
ggplot(updates_by_month, aes(x = as.Date(paste0(update_month, "-01")), y = count)) +
geom_line(color = "blue") +
geom_point(color = "red") +
labs(title = "Number of App Updates Over Time",
x = "Date",
y = "Number of Updates") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
The number of updates have drastically increased from the end of
2017
# Content Rating and Update Frequency Relationship
update_frequency_by_rating <- data_final %>%
group_by(Content.Rating) %>%
summarize(
avg_last_update = mean(Last.Updated),
median_last_update = median(Last.Updated),
n_apps = n()
)
print("\nUpdate Frequency by Content Rating:")
## [1] "\nUpdate Frequency by Content Rating:"
print(update_frequency_by_rating)
## # A tibble: 6 × 4
## Content.Rating avg_last_update median_last_update n_apps
## <chr> <date> <date> <int>
## 1 adults only 18+ 2018-07-14 2018-07-14 2
## 2 everyone 2017-09-30 2018-04-09 5777
## 3 everyone 10+ 2017-12-06 2018-06-02 221
## 4 mature 17+ 2018-01-29 2018-07-03 264
## 5 teen 2017-11-09 2018-05-24 714
## 6 unrated 2015-06-24 2015-06-24 1
# Chi-square test for independence
# Creating contingency table of Content Rating vs Update Year
content_update_table <- table(
data_final$Content.Rating,
year(data_final$Last.Updated)
)
chi_test <- chisq.test(content_update_table)
print("\nChi-square test results:")
## [1] "\nChi-square test results:"
print(chi_test)
##
## Pearson's Chi-squared test
##
## data: content_update_table
## X-squared = 65.18, df = 40, p-value = 0.00718
The P value is small signifying that there is statistically significant relationship between Content Rating and Last Updated Columns
# Content Rating Basic Analysis
print("Basic Content Rating Analysis:")
## [1] "Basic Content Rating Analysis:"
content_rating_counts <- table(data_final$Content.Rating)
print(content_rating_counts)
##
## adults only 18+ everyone everyone 10+ mature 17+ teen
## 2 5777 221 264 714
## unrated
## 1
# Basic bar plot for Content Rating
ggplot(data_final, aes(x = Content.Rating)) +
geom_bar(fill = "skyblue") +
geom_text(stat = "count", aes(label = ..count..), vjust = -0.5) +
labs(title = "Distribution of App Content Ratings",
x = "Content Rating",
y = "Number of Apps") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Calculate percentages
content_rating_percentages <- prop.table(content_rating_counts) * 100
print("\nContent Rating Percentages:")
## [1] "\nContent Rating Percentages:"
print(round(content_rating_percentages, 2))
##
## adults only 18+ everyone everyone 10+ mature 17+ teen
## 0.03 82.78 3.17 3.78 10.23
## unrated
## 0.01
# 1.2 Last Updated Basic Analysis
data_final$Last.Updated <- as.Date(data_final$Last.Updated, format = "%B %d, %Y")
print("\nLast Updated Summary Statistics:")
## [1] "\nLast Updated Summary Statistics:"
summary(data_final$Last.Updated)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## "2010-05-21" "2017-07-12" "2018-04-20" "2017-10-11" "2018-07-13" "2018-08-08"
Last Updated is the most dominant Category with 81.82% of all apps and Adults 18+ being most least significant category with about 0.03% of overall app population
# Time-based Analysis
data_final <- data_final %>%
mutate(
update_year = year(Last.Updated),
update_month = month(Last.Updated),
update_quarter = quarter(Last.Updated),
days_since_update = as.numeric(difftime(max(Last.Updated), Last.Updated, units = "days"))
)
# Monthly update pattern
monthly_updates <- data_final %>%
group_by(update_year, update_month) %>%
summarize(count = n()) %>%
mutate(date = as.Date(paste(update_year, update_month, "01", sep = "-")))
ggplot(monthly_updates, aes(x = date, y = count)) +
geom_line(color = "blue") +
geom_point() +
labs(title = "App Updates Over Time",
x = "Date",
y = "Number of Updates") +
theme_minimal()
# 2.2 Content Rating Distribution by Update Quarter
ggplot(data_final, aes(x = factor(update_quarter), fill = Content.Rating)) +
geom_bar(position = "dodge") +
labs(title = "Content Rating Distribution by Quarter",
x = "Quarter",
y = "Count") +
theme_minimal()
# 3.1 Update Frequency Analysis by Content Rating
update_patterns <- data_final %>%
group_by(Content.Rating) %>%
summarize(
avg_days_since_update = mean(days_since_update),
median_days_since_update = median(days_since_update),
sd_days_since_update = sd(days_since_update),
n_apps = n()
) %>%
arrange(avg_days_since_update)
print("\nUpdate Patterns by Content Rating:")
## [1] "\nUpdate Patterns by Content Rating:"
print(update_patterns)
## # A tibble: 6 × 5
## Content.Rating avg_days_since_update median_days_since_update
## <chr> <dbl> <dbl>
## 1 adults only 18+ 25 25
## 2 mature 17+ 191. 35.5
## 3 everyone 10+ 244. 67
## 4 teen 271. 76
## 5 everyone 312. 121
## 6 unrated 1141 1141
## # ℹ 2 more variables: sd_days_since_update <dbl>, n_apps <int>
# 3.2 Statistical Tests
# Chi-square test for independence
contingency_table <- table(data_final$Content.Rating, data_final$update_quarter)
chi_test <- chisq.test(contingency_table)
print("\nChi-square test for independence between Content Rating and Update Quarter:")
## [1] "\nChi-square test for independence between Content Rating and Update Quarter:"
print(chi_test)
##
## Pearson's Chi-squared test
##
## data: contingency_table
## X-squared = 54.655, df = 15, p-value = 2.041e-06
# 3.3 Advanced Visualization - Heatmap of Updates
update_heatmap_data <- data_final %>%
group_by(update_month, Content.Rating) %>%
summarize(count = n()) %>%
spread(Content.Rating, count)
# Convert to matrix for heatmap
update_matrix <- as.matrix(update_heatmap_data[,-1])
rownames(update_matrix) <- month.abb[update_heatmap_data$update_month]
# Create heatmap
heatmap(update_matrix,
Colv = NA,
Rowv = NA,
scale = "column",
col = colorRampPalette(c("white", "steelblue"))(50),
main = "Update Pattern Heatmap by Content Rating",
xlab = "Content Rating",
ylab = "Month")
# 3.4 Time Series Decomposition
# Focus on Everyone category as an example
everyone_ts <- monthly_updates %>%
filter(count > 0) %>%
select(count) %>%
ts(frequency = 12)
decomposed <- decompose(everyone_ts)
plot(decomposed)
# 3.5 Update Velocity Analysis
update_velocity <- data_final %>%
group_by(Content.Rating) %>%
summarize(
update_velocity = n() / n_distinct(update_month),
total_apps = n()
) %>%
arrange(desc(update_velocity))
print("\nUpdate Velocity by Content Rating:")
## [1] "\nUpdate Velocity by Content Rating:"
print(update_velocity)
## # A tibble: 6 × 3
## Content.Rating update_velocity total_apps
## <chr> <dbl> <int>
## 1 everyone 481. 5777
## 2 teen 59.5 714
## 3 mature 17+ 22 264
## 4 everyone 10+ 18.4 221
## 5 adults only 18+ 2 2
## 6 unrated 1 1
###Observation for Update Frequency Velocity Analysis: This column represents the average number of updates per app for each content rating category. It reflects how frequently apps in each category receive updates.
# 1. Update Cycle Analysis
data_final <- data_final %>%
mutate(
Last.Updated = as.Date(Last.Updated, format = "%B %d, %Y"),
day_of_week = wday(Last.Updated, label = TRUE),
week_of_year = week(Last.Updated),
month_of_year = month(Last.Updated, label = TRUE),
season = case_when(
month_of_year %in% c("Dec", "Jan", "Feb") ~ "Winter",
month_of_year %in% c("Mar", "Apr", "May") ~ "Spring",
month_of_year %in% c("Jun", "Jul", "Aug") ~ "Summer",
TRUE ~ "Fall"
)
)
# Day of Week Update Pattern by Content Rating
dow_pattern <- data_final %>%
group_by(Content.Rating, day_of_week) %>%
summarise(count = n()) %>%
group_by(Content.Rating) %>%
mutate(percentage = count/sum(count) * 100)
ggplot(dow_pattern, aes(x = day_of_week, y = percentage, fill = Content.Rating)) +
geom_bar(stat = "identity", position = "dodge") +
facet_wrap(~Content.Rating) +
labs(title = "Update Day Preferences by Content Rating",
x = "Day of Week",
y = "Percentage of Updates") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# 2. Update Interval Analysis
update_intervals <- data_final %>%
group_by(Content.Rating) %>%
arrange(Last.Updated) %>%
mutate(days_between_updates = as.numeric(Last.Updated - lag(Last.Updated))) %>%
summarise(
mean_interval = mean(days_between_updates, na.rm = TRUE),
median_interval = median(days_between_updates, na.rm = TRUE),
std_dev = sd(days_between_updates, na.rm = TRUE),
cv = std_dev / mean_interval * 100 # Coefficient of Variation
)
print("Update Interval Analysis:")
## [1] "Update Interval Analysis:"
print(update_intervals)
## # A tibble: 6 × 5
## Content.Rating mean_interval median_interval std_dev cv
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 adults only 18+ 20 20 NA NA
## 2 everyone 0.520 0 4.14 798.
## 3 everyone 10+ 12.2 1 61.9 509.
## 4 mature 17+ 8.16 1 28.0 343.
## 5 teen 3.42 0 17.9 522.
## 6 unrated NaN NA NA NA
# 3. Seasonal Update Intensity
seasonal_intensity <- data_final %>%
group_by(Content.Rating, season) %>%
summarise(
update_count = n(),
update_intensity = n() / n_distinct(Last.Updated)
) %>%
arrange(Content.Rating, desc(update_intensity))
# Visualization of seasonal patterns
ggplot(seasonal_intensity, aes(x = season, y = update_intensity, fill = Content.Rating)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Seasonal Update Intensity by Content Rating",
x = "Season",
y = "Update Intensity") +
theme_minimal()
# 4. Update Clustering Analysis
update_features <- data_final %>%
group_by(Content.Rating) %>%
summarise(
mean_week = mean(week_of_year),
std_week = sd(week_of_year),
update_frequency = n(),
weekend_ratio = sum(day_of_week %in% c("Sat", "Sun")) / n()
)
# Normalize the features
update_features_norm <- scale(update_features[,-1])
rownames(update_features_norm) <- update_features$Content.Rating
# Perform hierarchical clustering
update_clusters <- hclust(dist(update_features_norm))
plot(update_clusters, main = "Hierarchical Clustering of Content Ratings by Update Patterns")
# 6. Update Consistency Score
consistency_score <- data_final %>%
group_by(Content.Rating) %>%
summarise(
total_updates = n(),
unique_days = n_distinct(Last.Updated),
consistency_score = (total_updates / unique_days) *
(1 - sd(as.numeric(day_of_week)) / 7) # Normalized consistency metric
) %>%
arrange(desc(consistency_score))
print("\nUpdate Consistency Scores:")
## [1] "\nUpdate Consistency Scores:"
print(consistency_score)
## # A tibble: 6 × 4
## Content.Rating total_updates unique_days consistency_score
## <chr> <int> <int> <dbl>
## 1 everyone 5777 1198 3.64
## 2 teen 714 350 1.55
## 3 mature 17+ 264 141 1.40
## 4 everyone 10+ 221 141 1.21
## 5 adults only 18+ 2 2 0.899
## 6 unrated 1 1 NA
# Convert Last.Updated to numeric (days since reference date) if not already done
reference_date <- min(data_final$Last.Updated, na.rm = TRUE) # Reference date
data_final$Days.Since.Update <- as.numeric(data_final$Last.Updated - reference_date)
# Perform the Kolmogorov-Smirnov test on the numeric 'Days.Since.Update' values
content_ratings <- unique(data_final$Content.Rating)
ks_results <- data.frame(
rating1 = character(),
rating2 = character(),
p_value = numeric()
)
for (i in 1:(length(content_ratings)-1)) {
for (j in (i+1):length(content_ratings)) {
# Extract groups, removing NA values
group1 <- na.omit(data_final$Days.Since.Update[data_final$Content.Rating == content_ratings[i]])
group2 <- na.omit(data_final$Days.Since.Update[data_final$Content.Rating == content_ratings[j]])
# Check if both groups have enough data for comparison
if(length(group1) > 1 && length(group2) > 1) {
ks_test <- ks.test(group1, group2)
ks_results <- rbind(ks_results,
data.frame(rating1 = content_ratings[i],
rating2 = content_ratings[j],
p_value = ks_test$p.value))
}
}
}
print("\nKolmogorov-Smirnov Test Results:")
## [1] "\nKolmogorov-Smirnov Test Results:"
print(ks_results[ks_results$p_value < 0.05,])
## rating1 rating2 p_value
## 1 teen everyone 1.260985e-04
## 3 teen mature 17+ 3.424678e-05
## 5 everyone everyone 10+ 1.111626e-03
## 6 everyone mature 17+ 1.787413e-12
## 8 everyone 10+ mature 17+ 8.647877e-03
# Clean and prepare the Installs column
data_final <- data_final %>%
mutate(
# Convert Last.Updated to Date format if not already
Last.Updated = as.Date(Last.Updated, format = "%B %d, %Y"),
# Convert Content.Rating to factor
Content.Rating = as.factor(Content.Rating)
)
# 1. Basic statistics for Installs by Content Rating
installs_by_rating <- data_final %>%
group_by(Content.Rating) %>%
summarise(
mean_installs = mean(Installs, na.rm = TRUE),
median_installs = median(Installs, na.rm = TRUE),
total_installs = sum(Installs, na.rm = TRUE),
n_apps = n()
) %>%
arrange(desc(mean_installs))
print("Summary of Installs by Content Rating:")
## [1] "Summary of Installs by Content Rating:"
print(installs_by_rating)
## # A tibble: 6 × 5
## Content.Rating mean_installs median_installs total_installs n_apps
## <fct> <dbl> <dbl> <dbl> <int>
## 1 everyone 10+ 13142477. 1000000 2904487480 221
## 2 teen 6724015. 500000 4800946402 714
## 3 everyone 3727217. 10000 21532131197 5777
## 4 mature 17+ 3279775. 500000 865860638 264
## 5 adults only 18+ 750000 750000 1500000 2
## 6 unrated 500 500 500 1
# 2. Visualize distribution of installs by content rating
ggplot(data_final, aes(x = Content.Rating, y = log10(Installs))) +
geom_boxplot(fill = "lightblue") +
labs(title = "Distribution of App Installs by Content Rating",
x = "Content Rating",
y = "Log10(Number of Installs)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# 3. Timeline analysis: Average installs over time by content rating
installs_timeline <- data_final %>%
group_by(Content.Rating, Last.Updated) %>%
summarise(avg_installs = mean(Installs, na.rm = TRUE)) %>%
ungroup()
ggplot(installs_timeline, aes(x = Last.Updated, y = log10(avg_installs), color = Content.Rating)) +
geom_smooth(method = "loess", se = FALSE) +
labs(title = "Average App Installs Over Time by Content Rating",
x = "Last Updated Date",
y = "Log10(Average Installs)") +
theme_minimal() +
theme(legend.position = "bottom")
# Remove rows where Installs has NA, NaN, or non-positive values (because log10 of 0 or negative is undefined)
data_final <- data_final %>%
filter(!is.na(Installs) & Installs > 0)
# ANOVA test for difference in installs between content ratings
install_anova <- aov(log10(Installs) ~ Content.Rating, data = data_final)
print("\nANOVA test results for Installs by Content Rating:")
## [1] "\nANOVA test results for Installs by Content Rating:"
print(summary(install_anova))
## Df Sum Sq Mean Sq F value Pr(>F)
## Content.Rating 5 663 132.63 39.52 <2e-16 ***
## Residuals 6960 23361 3.36
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# 5. Create time-based features for correlation analysis
data_analysis <- data_final %>%
mutate(
days_since_update = as.numeric(difftime(max(Last.Updated), Last.Updated, units = "days")),
update_year = year(Last.Updated),
update_month = month(Last.Updated)
)
# Calculate correlation between days since update and installs
correlation_result <- cor.test(data_analysis$days_since_update,
log10(data_analysis$Installs),
method = "spearman")
print("\nCorrelation between days since update and installs:")
## [1] "\nCorrelation between days since update and installs:"
print(correlation_result)
##
## Spearman's rank correlation rho
##
## data: data_analysis$days_since_update and log10(data_analysis$Installs)
## S = 7.2211e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.281761
# 6. Recent vs Old updates comparison
data_analysis <- data_analysis %>%
mutate(update_recency = ifelse(days_since_update <= median(days_since_update),
"Recent Update", "Old Update"))
recent_vs_old <- data_analysis %>%
group_by(Content.Rating, update_recency) %>%
summarise(
mean_installs = mean(Installs, na.rm = TRUE),
median_installs = median(Installs, na.rm = TRUE),
n_apps = n()
)
print("\nComparison of Installs by Update Recency and Content Rating:")
## [1] "\nComparison of Installs by Update Recency and Content Rating:"
print(recent_vs_old)
## # A tibble: 10 × 5
## # Groups: Content.Rating [6]
## Content.Rating update_recency mean_installs median_installs n_apps
## <fct> <chr> <dbl> <dbl> <int>
## 1 adults only 18+ Recent Update 750000 750000 2
## 2 everyone Old Update 1202045. 10000 2985
## 3 everyone Recent Update 6452365. 100000 2781
## 4 everyone 10+ Old Update 2717872. 100000 92
## 5 everyone 10+ Recent Update 20577080. 1000000 129
## 6 mature 17+ Old Update 1046779. 100000 88
## 7 mature 17+ Recent Update 4396273. 500000 176
## 8 teen Old Update 1200671. 50000 316
## 9 teen Recent Update 11165491. 1000000 396
## 10 unrated Old Update 500 500 1
# 7. Visualization of update recency effect
ggplot(data_analysis, aes(x = Content.Rating, y = log10(Installs), fill = update_recency)) +
geom_boxplot() +
labs(title = "Install Distribution by Content Rating and Update Recency",
x = "Content Rating",
y = "Log10(Number of Installs)",
fill = "Update Recency") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# 1. Encode content rating (e.g., as factor levels or one-hot encoding)
data_final$Content.Rating <- as.factor(data_final$Content.Rating)
# 2. Create days since last update
data_final$days_since_update <- as.numeric(difftime(Sys.Date(), data_final$Last.Updated, units = "days"))
# 3. Calculate correlations
# Log-transform installs for better normalization
data_final$log_installs <- log10(data_final$Installs)
# Correlation between days since update and installs
correlation_update_installs <- cor.test(data_final$days_since_update, data_final$log_installs, method = "spearman")
# ANOVA for installs based on content rating
anova_content_rating <- aov(log_installs ~ Content.Rating, data = data_final)
# Print results
print(correlation_update_installs)
##
## Spearman's rank correlation rho
##
## data: data_final$days_since_update and data_final$log_installs
## S = 7.2211e+10, p-value < 2.2e-16
## alternative hypothesis: true rho is not equal to 0
## sample estimates:
## rho
## -0.281761
print(summary(anova_content_rating))
## Df Sum Sq Mean Sq F value Pr(>F)
## Content.Rating 5 663 132.63 39.52 <2e-16 ***
## Residuals 6960 23361 3.36
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A moderate negative correlation :(ρ=−0.3317) was found between the number of days since the last update and the log-transformed installs. This indicates that as the time since the last update increases, the number of installs tends to decrease. The relationship is statistically significant (p < 2.2e-16), suggesting that timely updates may be crucial for maintaining user engagement.
Revealed significant differences in install counts based on content rating (F(5, 9638) = 41.95, p < 2e-16). This indicates that various content ratings have a substantial impact on the number of installs, highlighting the importance of content quality and type in attracting users.
These findings suggest that regular updates are important for sustaining app installs, and that different content ratings can influence user engagement. Strategies aimed at timely updates and optimizing content ratings could enhance app performance and user acquisition.